pacman::p_load(tidyverse, broom, kableExtra)
A/B Testing is a method commonly used in Data Science, Analytics, and Software Engineering to compare two or more options to determine which one performs better. It’s essentially an experiment designed to test a hypothesis in a controlled and measurable way.
While the term might not be familiar to everyone, the concept is straightforward, especially for those with a background in statistics. At its core, A/B testing helps answer questions like:
The “A” and “B” refer to the options being compared—such as a control (existing version) and a variation (new version). Depending on the scenario, this can extend to more than two options.
The tools you use to analyze the data depend on the nature of the experiment. Common statistical techniques include:
T-Test: Compares means between two groups.
ANOVA (Analysis of Variance): Tests differences across more than two groups.
Chi-Square Test (\(\chi^2\)): Examines relationships between categorical variables.
Kruskal-Wallis Test: A non-parametric method for comparing medians of multiple groups.
A/B testing provides a structured way to test ideas and make data-driven decisions, whether in marketing, product development, healthcare, or other fields.
one_sample <- rio::import('https://vincentarelbundock.github.io/Rdatasets/csv/mlmRev/bdf.csv') |> select(-rownames)
DT::datatable(one_sample, options=list(lengthMenu = c(3,10,30)), extensions="Responsive")
T-Tests are a type of test used when you want to compare 1-2 distributions and figure out if they are from the same distribution
Purpose: Tests whether the mean of a single sample is significantly different from a known or hypothesized value (the perceived mean).
Assumption:
Random Sample drawn from the population
Data is approximately normally distributed
Data is continuous or treated as approximately interval (e.g., aggregated Likert scales with sufficient responses)
The bdf data frame has 2287 rows and 25 columns of language scores from grade 8 pupils in elementary schools in The Netherlands.
The IQ of an 8th grader is (on this scale) believed to be about 10 I hypothesize that the average Verbal IQ of 8th graders is greater than 10. For this test we will set a 95% Confidence Interval.
If you are interested in the columns for the data it can be found here: Link to data dictionary
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| 11.83406 | 42.39452 | 0 | 2286 | 11.76287 | Inf | One Sample t-test | greater |
As seen in the results with a P-Value of approximately 0 There is statistically significant evidence to state that the Verbal IQ of 8th graders is greater than the population average of 10.
This graph helps to illustrate the mean of the sample data.
plotly::ggplotly(ggplot(one_sample, aes(x = IQ.verb)) +
geom_histogram(bins = 30) +
labs(x = 'Distribution', y = 'Count of Scores', title = 'The distribution of Verbal IQ scores', subtitle = 'Line is the mean') +
geom_vline(xintercept = mean(one_sample$IQ.verb)))
Purpose: Compares the means of two related groups
Assumptions:
The differences between paired observations should follow a normal distribution
Each pair is independent of the other pairs
Data is continuous or treated as approximately interval (e.g., aggregated Likert scales with sufficient responses)
Each observation in one group corresponds to one related observation in the other (e.g., pre vs post-test, twins)
The ChickWeight data frame has 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks.
I hypothesize that chicks at Day 20 will weigh more than at Day 10.
paired_data <- subset(ChickWeight, Time %in% c(10, 20)) |>
spread(key = Time, value = weight)
paired_result <- t.test(paired_data$`10`, paired_data$`20`, paired = TRUE, conf.level = 0.95, alternative = 'less')
tidy_paired <- tidy(paired_result)
tidy_paired %>%
kable(format = "html", caption = "One-Sample Verbal IQ T-Test Results") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| estimate | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|
| -100 | -12.9972 | 0 | 45 | -Inf | -87.07854 | Paired t-test | less |
The result of the paired t-test will tell us if there is a statistically significant difference in the mean weights of the chicks between Day 10 and Day 20. A p-value less than 0.05 would suggest that the average weight at Day 20 is significantly greater than at Day 10, supporting the hypothesis that the chicks gain weight over the course of the experiment.
As can be seen the P-value is 0 meaning that it is statistically significant that the weight of Chickens on day 20 is greater than the weight of day 10.
The graph below highlights how much of a weight discrepancy there actually is.
plotly::ggplotly(ggplot(subset(ChickWeight, Time %in% c(10, 20)) , aes(x = as.factor(Time), y = weight)) +
geom_boxplot() +
labs(x = 'Day Number', y = "Weight", title = "Weight of Chickens on Day 10 and 20"))
The independent t-test compares the means of two independent groups to determine whether there is a statistically significant difference between them.
Assumptions:
The data consists of two independent samples.
The data in each group is approximately normally distributed.
The variances of the two groups are equal (this assumption can be relaxed with the Welch t-test).
The data is continuous or treated as approximately interval.
For this one I will use the Iris data set.
This famous (Fisher’s or Anderson’s) iris data set gives the
measurements in centimeters of the variables sepal length and width and
petal length and width, respectively, for 50 flowers from each of 3
species of iris. The species are Iris setosa, versicolor, and
virginica.
We will have a confidence level of 0.95. The alternative is that they are not equal. I hypothesis that they will not be equal.
independent_result <- t.test(Petal.Length ~ Species, data = iris, subset = Species %in% c("setosa", "versicolor"), alternative = 'two.sided')
tidy_independent <- tidy(independent_result)
tidy_independent %>%
kable(format = "html", caption = "Independent T-Test Results: Petal Length for Setosa vs Versicolor") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
| estimate | estimate1 | estimate2 | statistic | p.value | parameter | conf.low | conf.high | method | alternative |
|---|---|---|---|---|---|---|---|---|---|
| -2.798 | 1.462 | 4.26 | -39.49272 | 0 | 62.13977 | -2.939618 | -2.656382 | Welch Two Sample t-test | two.sided |
From the result we can see that the P-Value is 0. This means that we will reject our null hypothesis in favor of the alternative meaning that the two plants have different petal lengths
The graph below again illustrates the difference between the two plants petal lengths
plotly::ggplotly(ggplot(subset(iris, Species %in% c("setosa", "versicolor")), aes(y = Petal.Length, x = Species)) +
geom_boxplot() +
labs(title = 'The difference in Petal Length between Plant Species'))